Sentence Clustering Using Continuous Vector Space Representation

نویسندگان

  • Mara Chinea-Rios
  • Germán Sanchis-Trilles
  • Francisco Casacuberta
چکیده

In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the k-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and F1 score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Extractive Summarization using Continuous Vector Space Models

Automatic summarization can help users extract the most important pieces of information from the vast amount of text digitized into electronic form everyday. Central to automatic summarization is the notion of similarity between sentences in text. In this paper we propose the use of continuous vector representations for semantically aware representations of sentences as a basis for measuring si...

متن کامل

s-Topological vector spaces

In this paper, we have dened and studied a generalized form of topological vector spaces called s-topological vector spaces. s-topological vector spaces are dened by using semi-open sets and semi-continuity in the sense of Levine. Along with other results, it is proved that every s-topological vector space is generalized homogeneous space. Every open subspace of an s-topological vector space is...

متن کامل

Approximation of fixed points for a continuous representation of nonexpansive mappings in Hilbert spaces

This paper introduces an implicit scheme for a   continuous representation of nonexpansive mappings on a closed convex subset of a Hilbert space with respect to a   sequence of invariant means defined on an appropriate space of bounded, continuous real valued functions of the semigroup.   The main result is to    prove the strong convergence of the proposed implicit scheme to the unique solutio...

متن کامل

Latent semantic sentence clustering for multi-document summarization

This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015